medical advice
Mind launches inquiry into AI and mental health after Guardian investigation
The Guardian revealed how people were being put at risk of harm by false and misleading health information in Google AI Overviews. The Guardian revealed how people were being put at risk of harm by false and misleading health information in Google AI Overviews. Exclusive: England and Wales charity to examine safeguards after Guardian exposed'very dangerous' advice on Google AI Overviews'Very dangerous': a Mind mental health expert on Google's AI summaries Mind is launching a significant inquiry into artificial intelligence and mental health after a Guardian investigation exposed how Google's AI Overviews gave people "very dangerous" medical advice. In a year-long commission, the mental health charity, which operates in England and Wales, will examine the risks and safeguards required as AI increasingly influences the lives of millions of people affected by mental health issues worldwide. The inquiry - the first of its kind globally - will bring together the world's leading doctors and mental health professionals, as well as people with lived experience, health providers, policymakers and tech companies.
- Europe > United Kingdom > Wales (0.45)
- Europe > United Kingdom > England (0.45)
- North America > United States (0.16)
- (2 more...)
Google puts users at risk by downplaying health disclaimers under AI Overviews
Google's AI Overviews only issue a warning if users choose to request additional health information, by selecting'Show more'. Google's AI Overviews only issue a warning if users choose to request additional health information, by selecting'Show more'. Google is putting people at risk of harm by downplaying safety warnings that its AI-generated medical advice may be wrong. When answering queries about sensitive topics such as health, the company says its AI Overviews, which appear above search results, prompt users to seek professional help, rather than relying solely on its summaries. "AI Overviews will inform people when it's important to seek out expert advice or to verify the information presented," Google has said .
- Europe > Ukraine (0.06)
- Oceania > Australia (0.05)
- North America > United States > Massachusetts (0.05)
- Leisure & Entertainment > Sports (0.71)
- Health & Medicine > Consumer Health (0.58)
- Government > Regional Government (0.51)
- Media > News (0.50)
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Communications > Social Media (0.74)
- Information Technology > Information Management > Search (0.50)
Dr. Bias: Social Disparities in AI-Powered Medical Guidance
With the rapid progress of Large Language Models (LLMs), the general public now has easy and affordable access to applications capable of answering most health-related questions in a personalized manner. These LLMs are increasingly proving to be competitive, and now even surpass professionals in some medical capabilities. They hold particular promise in low-resource settings, considering they provide the possibility of widely accessible, quasi-free healthcare support. However, evaluations that fuel these motivations highly lack insights into the social nature of healthcare, oblivious to health disparities between social groups and to how bias may translate into LLM-generated medical advice and impact users. We provide an exploratory analysis of LLM answers to a series of medical questions spanning key clinical domains, where we simulate these questions being asked by several patient profiles that vary in sex, age range, and ethnicity. By comparing natural language features of the generated responses, we show that, when LLMs are used for medical advice generation, they generate responses that systematically differ between social groups. In particular, Indigenous and intersex patients receive advice that is less readable and more complex. We observe these trends amplify when intersectional groups are considered. Considering the increasing trust individuals place in these models, we argue for higher AI literacy and for the urgent need for investigation and mitigation by AI developers to ensure these systemic differences are diminished and do not translate to unjust patient support. Our code is publicly available on GitHub.
- North America > United States > Alaska (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering
Li, Yahan, Yao, Jifan, Bunyi, John Bosco S., Frank, Adam C., Hwang, Angel, Liu, Ruishan
Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance. Responses were frequently flagged for safety risks, most notably unauthorized medical advice. Follow-up experiments show that LLM judges systematically overrate model responses and overlook safety concerns identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored mental health questions designed to trigger specific model issues. Evaluation of 3,240 responses from nine LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking LLMs in mental health QA.
- North America > United States > California (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > United States > Virginia (0.04)
- (7 more...)
- Questionnaire & Opinion Survey (1.00)
- Research Report > Experimental Study (0.94)
Randy Travis' wife defied medical advice to 'pull the plug' during country star's stroke recovery battle
Randy Travis's wife, Mary, told Fox News Digital doctors informed her to pull the plug after the country music star's stroke in 2013. Randy Travis' wife believes there was "never a doubt" in her husband's head that he would make it through his debilitating stroke, even though doctors advised her to "pull the plug." During an interview with Fox News Digital, Mary explained a crucial moment in Travis' two-and-a-half-year health battle that stood out to her, and that was the moment doctors told her to end her husband's life. "I think Randy, there was never a doubt in Randy's mind that he could make it through it. It was that magical moment that I went to his bedside when they said, 'We need to pull the plug. He's got too many things going against him at that point.' He had gotten a staph infection and three other hospital-born bacterial viruses like Serratia, Pseudomonas, one thing after another, and the doctors were just saying, 'He just doesn't have the strength to get through this,'" Mary said.
- North America > United States > Tennessee > Davidson County > Nashville (0.06)
- North America > United States > District of Columbia > Washington (0.05)
Large language models provide unsafe answers to patient-posed medical questions
Draelos, Rachel L., Afreen, Samina, Blasko, Barbara, Brazile, Tiffany L., Chase, Natasha, Desai, Dimple Patel, Evert, Jessica, Gardner, Heather L., Herrmann, Lauren, House, Aswathy Vaikom, Kass, Stephanie, Kavan, Marianne, Khemani, Kirshma, Koire, Amanda, McDonald, Lauren M., Rabeeah, Zahraa, Shah, Amy
Millions of patients are already using large language model (LLM) chatbots for medical advice on a regular basis, raising patient safety concerns. This physician-led red-teaming study compares the safety of four publicly available chatbots--Claude by Anthropic, Gemini by Google, GPT-4o by OpenAI, and Llama3-70B by Meta--on a new dataset, HealthAdvice, using an evaluation framework that enables quantitative and qualitative analysis. In total, 888 chatbot responses are evaluated for 222 patient-posed advice-seeking medical questions on primary care topics spanning internal medicine, women's health, and pediatrics. We find statistically significant differences between chatbots. The rate of problematic responses varies from 21.6 percent (Claude) to 43.2 percent (Llama), with unsafe responses varying from 5 percent (Claude) to 13 percent (GPT-4o, Llama). Qualitative results reveal chatbot responses with the potential to lead to serious patient harm. This study suggests that millions of patients could be receiving unsafe medical advice from publicly available chatbots, and further work is needed to improve the clinical safety of these powerful tools.
- North America > United States > California > San Francisco County > San Francisco (0.28)
- North America > United States > Virginia > Falls Church (0.04)
- North America > United States > Virginia > Albemarle County > Charlottesville (0.04)
- (11 more...)
Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian
Niculae, Andrei, Cosma, Adrian, Dumitrache, Cosmin, Rǎdoi, Emilian
Text-based telemedicine has become increasingly common, yet the quality of medical advice in doctor-patient interactions is often judged more on how advice is communicated rather than its clinical accuracy. To address this, we introduce Dr. Copilot , a multi-agent large language model (LLM) system that supports Romanian-speaking doctors by evaluating and enhancing the presentation quality of their written responses. Rather than assessing medical correctness, Dr. Copilot provides feedback along 17 interpretable axes. The system comprises of three LLM agents with prompts automatically optimized via DSPy. Designed with low-resource Romanian data and deployed using open-weight models, it delivers real-time specific feedback to doctors within a telemedicine platform. Empirical evaluations and live deployment with 41 doctors show measurable improvements in user reviews and response quality, marking one of the first real-world deployments of LLMs in Romanian medical settings.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > Romania > București - Ilfov Development Region > Municipality of Bucharest > Bucharest (0.04)
- (2 more...)
- Research Report (0.64)
- Overview (0.46)
Towards physician-centered oversight of conversational diagnostic AI
Vedadi, Elahe, Barrett, David, Harris, Natalie, Wulczyn, Ellery, Reddy, Shashir, Ruparel, Roma, Schaekermann, Mike, Strother, Tim, Tanno, Ryutaro, Sharma, Yash, Lee, Jihyeon, Hughes, Cían, Slack, Dylan, Palepu, Anil, Freyberg, Jan, Saab, Khaled, Liévin, Valentin, Weng, Wei-Hung, Tu, Tao, Liu, Yun, Tomasev, Nenad, Kulkarni, Kavita, Mahdavi, S. Sara, Guu, Kelvin, Barral, Joëlle, Webster, Dale R., Manyika, James, Hassidim, Avinatan, Chou, Katherine, Matias, Yossi, Kohli, Pushmeet, Rodman, Adam, Natarajan, Vivek, Karthikesalingam, Alan, Stutz, David
Recent work has demonstrated the promise of conversational AI systems for diagnostic dialogue. However, real-world assurance of patient safety means that providing individual diagnoses and treatment plans is considered a regulated activity by licensed professionals. Furthermore, physicians commonly oversee other team members in such activities, including nurse practitioners (NPs) or physician assistants/associates (PAs). Inspired by this, we propose a framework for effective, asynchronous oversight of the Articulate Medical Intelligence Explorer (AMIE) AI system. We propose guardrailed-AMIE (g-AMIE), a multi-agent system that performs history taking within guardrails, abstaining from individualized medical advice. Afterwards, g-AMIE conveys assessments to an overseeing primary care physician (PCP) in a clinician cockpit interface. The PCP provides oversight and retains accountability of the clinical decision. This effectively decouples oversight from intake and can thus happen asynchronously. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) of text consultations with asynchronous oversight, we compared g-AMIE to NPs/PAs or a group of PCPs under the same guardrails. Across 60 scenarios, g-AMIE outperformed both groups in performing high-quality intake, summarizing cases, and proposing diagnoses and management plans for the overseeing PCP to review. This resulted in higher quality composite decisions. PCP oversight of g-AMIE was also more time-efficient than standalone PCP consultations in prior work. While our study does not replicate existing clinical practices and likely underestimates clinicians' capabilities, our results demonstrate the promise of asynchronous oversight as a feasible paradigm for diagnostic AI systems to operate under expert human oversight for enhancing real-world care.
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Research Report > Strength High (0.92)
AI companies have stopped warning you that their chatbots aren't doctors
"Then one day this year," Sharma says, "there was no disclaimer." Curious to learn more, she tested generations of models introduced as far back as 2022 by OpenAI, Anthropic, DeepSeek, Google, and xAI--15 in all--on how they answered 500 health questions, such as which drugs are okay to combine, and how they analyzed 1,500 medical images, like chest x-rays that could indicate pneumonia. The results, posted in a paper on arXiv and not yet peer-reviewed, came as a shock--fewer than 1% of outputs from models in 2025 included a warning when answering a medical question, down from over 26% in 2022. Just over 1% of outputs analyzing medical images included a warning, down from nearly 20% in the earlier period. To seasoned AI users, these disclaimers can feel like formality--reminding people of what they should already know, and they find ways around triggering them from AI models.
A Systematic Analysis of Declining Medical Safety Messaging in Generative AI Models
Sharma, Sonali, Alaa, Ahmed M., Daneshjou, Roxana
Generative AI models, including large language models (LLMs) and vision-language models (VLMs), are increasingly used to interpret medical images and answer clinical questions. Their responses often include inaccuracies; therefore, safety measures like medical disclaimers are critical to remind users that AI outputs are not professionally vetted or a substitute for medical advice. This study evaluated the presence of disclaimers in LLM and VLM outputs across model generations from 2022 to 2025. Using 500 mammograms, 500 chest X-rays, 500 dermatology images, and 500 medical questions, outputs were screened for disclaimer phrases. Medical disclaimer presence in LLM and VLM outputs dropped from 26.3% in 2022 to 0.97% in 2025, and from 19.6% in 2023 to 1.05% in 2025, respectively. By 2025, the majority of models displayed no disclaimers. As public models become more capable and authoritative, disclaimers must be implemented as a safeguard adapting to the clinical context of each output.
- North America > United States > California > San Francisco County > San Francisco (0.28)
- North America > United States > California > Alameda County > Berkeley (0.14)
- North America > United States > California > Santa Clara County > Stanford (0.05)
- (3 more...)
- Health & Medicine > Therapeutic Area (0.95)
- Health & Medicine > Diagnostic Medicine > Imaging (0.70)